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e Carnegie 
Overview ark Re 
= Test-centric safety assurance 
e E.g., for autonomous vehicles 
e But testing alone is too expensive, so... 


= Bootstrapping schemes 
e Bootstrapping by miles 
e Phased deployment 
e “Probably perfect” arguments 


= Conclusion: they wont work the way you hope they will 
e Driver-out “safety testing’ is unsafe 
e Bootstrap testing won't fix this 
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Good for identifying common scenarios 
e Expensive; risk of a high profile crash 
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AV Industry 
Original Plan: 
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ADS Technology has come to be: 


Ford VSSA_ https://bit.ly/3njionT 


Sold Based on Safety 


Waymo VSSA hitps://bit.ly/2QuYhai 


We're Building 
a Safer Driver 
for Everyone 


Self-driving vehicles hold the promise 

to improve road safety and offer new 
mobility options to millions of people. 
Whether they’ré Saving lives or helping 
people run errands, commuteto work, 
or drop kids off at school, fully self- 
driving vehicles hold enormous potential 
to transform people’s livés*forthe better. 


Safety is at the core of Waymo's 
mission—it’s why we were founded 
over a decade ago as the Google 
Self-Driving Car Project. 
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How Safe Is “Safe?” omer 


SAFE ENOUGH? 


easuring and Predicting 
Autonomous Vehicle Safety 


= ~100M miles/fatal mishap for human drivers (US) 
e 28% Alcohol impaired/Driving Under Influence 
e 26% Speed-related 
e 9% distracted driving 
e 2% drowsy ... Denise ice Spotuset Ge 


(total > 100% due to multiple factors in some mishaps) es Hy 
= Fully functional drivers are much better >» ay. 
m New AV has better safety than 10+ year old “ average ‘Car 


> Better than an unimpaired, undistracted driver in new car 
e (“Safe Enough” is complicated — but a different talk.) 
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Safety Via Brute Force Road Testing (?) =", 


% WolframAlpha e222... 


= Say 200M miles/critical mishap... 


e Test 3x—10x longer than mishap rate 
=> Need 2 Billion miles of testing 


miles of roads| 


4.03 million mi 


= That's ~50 round trips <r 
on every road in the world 
e With fewer than 10 critical mishaps 


e Even more testing if you find a 
defect and redo some testing 


=m Required scale is infeasible 


17360000 720000 ® 1.4 million 1.8 million 
10 1 720000 1.1 million @1.8 million t 2.1 million 
14 m 360000 © 1.1 million 1.4 million & ~ 2.1 million 
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Highly scalable 
e “All models are wrong; some are useful.” (George Box) 
e “Simulations are doomed to succeed.” 


Still need real world miles to validate the simulations 
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AV Industry 
Improved Plan: 


PHase 1 PHase2 PuHase 


Profit 
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Bootstrapping To The Rescue (maybe) ok 


University 


1. Observe safety driver stops intervening 
2. Remove safety driver 


3. Crash-free history predicts 
crash is unlikely for a small window 


4. Drive for small window with no crash 
5. Repeat Steps 3 & 4, with growing window size 


= Incremental approach to road testing assurance t 


= Variations 
e Pure mileage-based bootstrapping 
e Phased deployment, slow update roll-out 
e Combine with belief in probably perfect design © 2022 Philip Koopman 10 
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The Demo Question fen 


Starsky retrofitted Volvo truck 


m= Hypothetically: 10K miles with safety driver  <ompictest0 mites unmanned’ on 


e Zero safety driver interventions POONER: ace janis 
e 95% confidence MTBF,,,,.,>3338 miles 


= Need “driver out” demo for funding milestone jj 
e Demo 10 miles without driver 
e Company fails if you don't demo on time 


= What are odds of a crash on this demo? 
e R(t)=e-"* for 4 = 1/3338, t=-10 =» 99.7% no crash 


https://reliabilityanalyticstoolkit.appspot.com/ 


El Do you do the demo? mtbf_test_calculator 
e If there is no crash in the demo, was that safe? oe ee een 
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One-Off Events -— Safe? Or Just Lucky? = Me™", 

= There is a 99.7% of no crash for a demo 

e You run the demo... and ... no crash 

e Claim: “therefore the demo was safe” 
= What are flaws in this argument? 

e Jumped out of an undamaged airplane 

— Parachute opened, so it was perfectly safe 

e Swam with sharks ... still have all limbs 

= Is evading a hazard once “safe” ?? 


e Getting away with taking a risk ... pity/30Sqa4 
is not quite the same as safety 


e Public road testing imposes risks on non-consenting road users 
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® . Carnegie 
Mileage-Based Bootstrap First Step ee 
= Example: 100K miles testing with safety driver 

e Zero safety driver interventions 


e 95% confidence MTBF,,,,.,>33380 miles 
— (Note: automotive often does about 70% confidence) 


= Do 100 miles of testing with no safety driver 
e R(t) = e-*t for 4 = 1/33380, t=100 = 99.7% no crash 


= Now you have 100,100 miles with no crash 
e 95% confidence MTBF,,,,.,>33414 miles 
e Notice that 33,414 > 33,380 ... hmmm ... interesting! 
— We can bootstrap our way to proving safety! 


https://reliabilityanalyticstoolkit.appspot.com/mtbf_test_calculator © 2022 Philip Koopman 13 


eo Carnegie 
Naive Bootstrap Argument le SE 

= Start with baseline testing with safety driver 
e Perhaps 1M miles? (much less than 100M miles) 

e Then remove safety driver > driverless testing 


= Iteratively longer test cycles 
e Test for X miles based on crash probability 
e Each step yields bigger MTBF 
e Next step can be X+6 miles due to larger MTBF 
.. Math, math, math ... 
= Lather. Rinse. Repeat. 
e Prove you are safer than a human driver PHase 3 
=> SSProfitS$S ee © 2022 Philip Koopman 14 
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What Happens If You Get A Crash? Mellon 
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= Need to test longer if there is a crash 
e For 200M miles @ 95% confidence 
— ~600M miles of testing required for no crash 
— With 1 crash: ~949M miles of testing 
-— 2 crashes: ~1259M miles 
s TuSimple 
-— 5 crashes: ~2103M miles Crash 
April 2022 


= Probability of crashes is high 


e At 200M MTBF,,,,.,,, probability of crash by 600M miles is 95% 
e The math is not in your favor here ... luck is required 
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Argue /hat Crash Didn't Count lack FE 
= That crash does not count because {reasons} 
e It was the other driver's fault US safety regulators open special 


: ; investigation into Cruise AV crash 
— Acrash is still a crash 3 


e It was a freak/black swan occurrence 
— Acrash is still a crash 
e It was a near miss instead of a crash 
— Near misses are not reported to regulators 
= Argue that bug was fixed 
e Impact analysis performed 
— Do you believe in the 0% fault reinjection rate fairy? LEI; 
e Surely that was the last defect in the system. (Rea//y?) 
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Pure Bootstrap Safety Issues ike 


= No expectation of safety up front 
e Confirms if system happens to be safe Lo eri oder Drvetes 
esting 
@ Does not somehow make system safe As of November 19, 2021, DMV has issued 


Autonomous Vehicle Driverless Testing Permits to 
the following entities: 


m Are repeated cycles of 99.7% “safe” ethical? «—-*srnowovsorwncusauc 


e Insufficiently low bounds on mishap rate - CRUISE LLC 


¢ NURO, INC 


= Find out system is unsafe is via an early crash -""°“ 


e Bootstrapping in effect justifies one “free” fatality onan 
=» There is no such thing as uncrewed AV safety testing 


e Really it is just deployment of unproven technology 
— Pony.Al lost permit in May 2022 — empty vehicle crash 


© 2022 Philip Koopman 17 


Carnegie 
Mellon 
University 


Introduce new versions slowly / 
initially operate with small pilot fleet 


e Reduces chance of large fleet having 
an early catastrophic failure 


e Said to be “safe” due to reduced risk 
— A variant on the one-off exposure fallacy 


— Reduces risk of multiple concurrent early mishaps ities 
— Risk reduction is not safety .. different talk | 


Amounts to a bootstrap safety argument 


e Safety risk presented to individuals is unchanged ~ 
— Loss events could still happen at unacceptable rate 


© 2022 Philip Koopman 18 


Probably Perfect System Nclon 


University 
Bishop, Povyakalo, Strigini 2021 [nttps://arxiv.org/pdf/2110.10718. pdf 


m= Take credit for “probably perfect” 
e E.g., 90% probability it is safe 
e Allows faster bootstrapping 


= Still might deploy unsafe system 
e E.g., 10% probability it is unsafe 
— Accumulated failure probability adds up quickly! 
e Argument destroyed along with first crash 
— Any early crash falsifies “probably perfect” hypothesis 
e Bayesian prior of “we think it is probably perfect”... 


... IS Still an early deployment of a “possibly unsafe” sy 


stem 
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SPlIs and Lifecycle Feedback Unjeertaly 


= SPI: direct measurement of safety case claim failure 
e Independent of reasoning (“claim is X ... yet here is ~X”) 


= A falsified safety case claim: 


CLAIM 
e Safety case has some defect C5} Is Claim 
False? 
= Root cause analysis might reveal: ee 
| = ae eric, resno 
e Product or process defect : f 


e Invalid safety argument 
e Issue with supporting evidence S°*8SUMENT™ 
e Assumption error 


Sub-CLAIM 1A |Sub-CLAIM 1B 
° | ° 
=) 


= Continual Safety case improvement 
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SPI-Based Feedback Approach Vttals 


= Safety Case argues acceptable risk 
e SPls monitor validity of safety case 


SOTIF 


TRIGGERING EVENTS pap ES 


SAFETY 
MONITOR 


HAZARD 
ANALYSIS 
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Detailed SPI Definition Mellon 
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= An SPI is a metric supported by evidence that uses a 
threshold comparison to condition a safety case claim. 
e Metric: measurement of performance, design quality, process 
quality, operational procedure conformance, etc. 
e Threshold: acceptance test on metric value 
— Often statistical (e.g., fewer than X events per billion miles) 
e Evidence: data used to compute the metric 
e Condition a claim: threshold violation falsifies a specific claim 
— Argument for claim is (potentially) proven false by SPI 
e Definition ties the metric directly to the safety argument 
= SPI violation: part of a safety case has been falsified 
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_\_, Sketen of an AV Safety Argument. | tia 


University 
AV is safe enough to deploy because: 
= We've followed industry safety standards & strong safety culture 
=m Known hazards have been mitigated 

e Residual risk is acceptable at system level 
= Arrival rate of unknowns is low 


e Incidents which do not trigger 
runtime safing have low consequence 


m Safety case has good SPI coverage 
= SPlis usually detect unknowns without an actual crash 

e System is fixed to mitigate unknowns before likely reoccurrence 
=» Idea: bootstrap on surprise arrival rates & SPI improvement 
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e Carnegie 
Conclusions ae 
= Bootstrap testing is an appealing, but bad idea 
e Pure miles — safety is just a hope 
e Slow rolling — risk reduction is not safety 


e More complex approaches: 


— Maybe(?) saves the very last testing iteration 
if playing the odds on “probably perfect” 


= Driver-out “safety testing’ is unsafe 
e Keep driver in until safe enough to deploy 


= Perhaps SPI bootstrapping can help 
e Bootstrap the safety case, not testing miles 
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